23 research outputs found

    Practical algorithms for selection on coarse-grained parallel computers

    Full text link

    Efficient Evaluation of Sparse Data Cubes

    Full text link
    available at www.springerlink.com ***Note: Figures may be missing from this format of the document Computing data cubes requires the aggregation of measures over arbitrary combinations of dimensions in a data set. Efficient data cube evaluation remains challenging because of the potentially very large sizes of input datasets (e.g., in the data warehousing context), the well-known curse of dimensionality, and the complexity of queries that need to be supported. This paper proposes a new dynamic data structure called SST (Sparse Statistics Trees) and a novel, in-teractive, and fast cube evaluation algorithm called CUPS (Cubing by Pruning SST), which is especially well suitable for computing aggregates in cubes whose data sets are sparse. SST only stores the aggregations of non-empty cube cells instead of the detailed records. Furthermore, it retains in memory the dense cubes (a.k.a. iceberg cubes) whose aggregate values are above a threshold. Sparse cubes are stored on disks. This allows a fast, accurate approximation for queries. If users desire more refined answers, related sparse cubes are aggregated. SST is incrementally maintainable, which makes CUPS suitable for data warehousing and analysis of streaming data. Experiment results demonstrate the excellent performance and good scalability of our approach. Article

    A scalable parallel subspace clustering algorithm for massive data sets

    No full text
    Clustering is a data mining problem which finds dense regions in a sparse multi-dimensional data set. The attribute values and ranges of these regions characterize the clusters. Clustering algorithms need to scale with the data base size and also with the large dimensionality of the data set. Further, these algorithms need to explore the embedded clusters in a sub-space of a high dimensional space. However, the time complexity of the algorithm to explore clusters in subspaces is exponential in the dimension-ality of the data and is thus extremely compute intensive. Thus, paral-lelization is the choice for discovering clusters for large data sets. In this paper we present a scalable parallel subspace clustering algorithm which has both data and task parallelism embedded in it. We also formulate the technique of adaptive grids and present a truly un-supervised clustering al-gorithm requiring no user inputs. Our implementation shows near linear speedups with negligible communication overheads. The use of adaptive grids results in two orders of magnitude improvement in the computation time of our serial algorithm over current methods with much better quality of clustering. Performance results on both real and synthetic data sets with very large number of dimensions on a 16 node IBM SP2 demonstrate our algorithm to be a practical and scalable clustering technique. 1

    Driving Scientific Applications by Data in Distributed Environments

    No full text

    A Multidimensional OLAP Engine Implementation in Key-Value Database Systems

    No full text
    corecore